This chapter covers data analysis of public bike sharing systems around the world. It’s the first of a few chapters covering different transport means, such as airplanes, buses, cars and others. The idea of this project is to apply data analysis techniques to real world data sets. In this first chapter I will show a full data analysis cycle, starting in gathering data and ending by creating a prediction model of public bike sharing system capacity in London city.
This document is built using markdown syntax and then generated HTML code for publishing the whole on the Internet. The flow is basically splitted in three parts, gathering, storing and modeling data.
The following code is coming from get_citybik.R In order to gather data, I will be using CityBikes API www.citybik.es where you can access in real time to data of dozens of public bike sharing systems around the world.
First we need to set our current working directory. Beware of changing this working directory if you happen to be working with these data by your own.
# 1.-GOAL: Get list of bikes systems and complete with full URL
# Load jsonlite package
library(jsonlite)
# Set working directory (Remember to change this directory on demand)
#setwd("/home/albertogonzalez/Desktop/TDA/chapters/bikes")
Get a list of all networks using the network endpoint. Basically I want to know which bike systems are available using this endpoint.
# Get all networks involved
network_endpoint = "http://api.citybik.es/v2/networks"
network_endpoint_df <- jsonlite::fromJSON(network_endpoint, simplifyDataFrame = TRUE)
Get valid data from the list of all networks and incorporate a column with the suffix URL.
# Refine endpoint output
company_names = network_endpoint_df$networks
company_ids = company_names$id
company_data = company_names$location
company_url = company_names$href
# Build a suffix to endpoint
suffix = "http://api.citybik.es"
company_url = paste(suffix,company_names$href,sep="")
company_df = cbind(company_ids,company_data,company_url)
This is the resulting table:
head(company_df)
## company_ids city country latitude longitude
## 1 opole-bike Opole PL 50.66450 17.92760
## 2 wroclawski-rower-miejski Wrocław PL 51.10970 17.04850
## 3 veturilo Warszawa PL 52.23450 21.00240
## 4 bubi Budapest HU 47.49791 19.04023
## 5 kvb-rad Köln DE 50.93640 6.96053
## 6 bemowo-bike Warszawa PL 52.26060 20.92900
## company_url
## 1 http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3 http://api.citybik.es/v2/networks/veturilo
## 4 http://api.citybik.es/v2/networks/bubi
## 5 http://api.citybik.es/v2/networks/kvb-rad
## 6 http://api.citybik.es/v2/networks/bemowo-bike
Export a CSV file with the previous dataframe just in case we need this information at any time.
# Export list of companies
write.csv(company_df,file = "list_of_bikes_system_worldwide.csv",row.names = FALSE)
Now I am going to create a function to get data from all our systems.
head(company_df)
## company_ids city country latitude longitude
## 1 opole-bike Opole PL 50.66450 17.92760
## 2 wroclawski-rower-miejski Wrocław PL 51.10970 17.04850
## 3 veturilo Warszawa PL 52.23450 21.00240
## 4 bubi Budapest HU 47.49791 19.04023
## 5 kvb-rad Köln DE 50.93640 6.96053
## 6 bemowo-bike Warszawa PL 52.26060 20.92900
## company_url
## 1 http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3 http://api.citybik.es/v2/networks/veturilo
## 4 http://api.citybik.es/v2/networks/bubi
## 5 http://api.citybik.es/v2/networks/kvb-rad
## 6 http://api.citybik.es/v2/networks/bemowo-bike
I will start with one system and then apply the same function to a set of cities, although you could do it for the >400 cities included in this magnificent API.
# 2.- For each system, get data (build DB and cron job)
# Build a function to read, parse and store endpoint data
bike_input = "http://api.citybik.es/v2/networks/wroclawski-rower-miejski"
# Read one system
bike_input_df = jsonlite::fromJSON(bike_input, simplifyDataFrame = TRUE)
bike_input_df_2 = bike_input_df$network
bike_input_df_3 = bike_input_df_2$stations
bike_df = bike_input_df_3[c(1,3:8)]
total_slots = bike_input_df_3$extra$slots
bike_df$total_slots = total_slots
The resulting table of capacity data in a given city:
head(bike_df)
## empty_slots free_bikes id latitude
## 1 15 0 52515b944e4a64584def861922aa6db1 51.12528
## 2 15 0 dfe66f3e0cb8ac29b2386df9d68341d7 51.13208
## 3 13 0 0138e16d65d5db5b6e15970b73686b1a 51.10723
## 4 15 0 7daa95e8b08a9d4b1401d5d7ed71e5c7 51.10098
## 5 15 0 ae5d14252719f50d17084268b9906747 51.09804
## 6 15 0 d56c50f2935797bcffcba9f4237ded56 51.11943
## longitude name
## 1 16.98439 Legnicka - Wejherowska
## 2 17.06550 Pl. Kromera
## 3 17.06133 Politechnika Wrocławska - Gmach Główny
## 4 17.00830 Pereca - Grabiszyńska
## 5 17.00680 Żelazna - Pereca
## 6 17.05151 Wyszyńskiego - Prusa
## timestamp total_slots
## 1 2016-12-31T09:23:28.508000Z 15
## 2 2016-12-31T09:23:28.442000Z 15
## 3 2016-12-31T09:23:28.471000Z 14
## 4 2016-12-31T09:23:28.500000Z 15
## 5 2016-12-31T09:23:28.507000Z 15
## 6 2016-12-31T09:23:28.534000Z 15
You can see some of the cities included in the API.
# Get id + url
input_array = company_df[c(1,6)]
head(input_array)
## company_ids
## 1 opole-bike
## 2 wroclawski-rower-miejski
## 3 veturilo
## 4 bubi
## 5 kvb-rad
## 6 bemowo-bike
## company_url
## 1 http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3 http://api.citybik.es/v2/networks/veturilo
## 4 http://api.citybik.es/v2/networks/bubi
## 5 http://api.citybik.es/v2/networks/kvb-rad
## 6 http://api.citybik.es/v2/networks/bemowo-bike
I will build two arrays with key information, the city id and its API url, which are needed to build the final query.
array_id = input_array$company_ids
array_id = as.data.frame(array_id)
array_url = input_array$company_url
array_url = as.data.frame(array_url)
head(array_id)
## array_id
## 1 opole-bike
## 2 wroclawski-rower-miejski
## 3 veturilo
## 4 bubi
## 5 kvb-rad
## 6 bemowo-bike
head(array_url)
## array_url
## 1 http://api.citybik.es/v2/networks/opole-bike
## 2 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 3 http://api.citybik.es/v2/networks/veturilo
## 4 http://api.citybik.es/v2/networks/bubi
## 5 http://api.citybik.es/v2/networks/kvb-rad
## 6 http://api.citybik.es/v2/networks/bemowo-bike
As previously shown, let’s create a function to gather data from one city, and then apply that same function to more than one city.
# Create the function for one system
foo = function(url,id){
input = as.character(url)
read_input = jsonlite::fromJSON(input, simplifyDataFrame = TRUE)
input_df_2 = read_input$network
input_df_3 = input_df_2$stations
input_df = input_df_3[c(1,3:8)]
input_df$company_id = as.character(id)
return (input_df)
}
Now it’s time to test the function in one city and then to a small amouny of cities and ultimately, if you wish, to the whole array of cities.
# Test fuction for one:
aa = array_url[2, ]
aa
## [1] http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 458 Levels: http://api.citybik.es/v2/networks/adcb-bikeshare ...
a = foo(aa,"wroclawski-rower-miejski")
head(a)
## empty_slots free_bikes id latitude
## 1 15 0 52515b944e4a64584def861922aa6db1 51.12528
## 2 15 0 dfe66f3e0cb8ac29b2386df9d68341d7 51.13208
## 3 13 0 0138e16d65d5db5b6e15970b73686b1a 51.10723
## 4 15 0 7daa95e8b08a9d4b1401d5d7ed71e5c7 51.10098
## 5 15 0 ae5d14252719f50d17084268b9906747 51.09804
## 6 15 0 d56c50f2935797bcffcba9f4237ded56 51.11943
## longitude name
## 1 16.98439 Legnicka - Wejherowska
## 2 17.06550 Pl. Kromera
## 3 17.06133 Politechnika Wrocławska - Gmach Główny
## 4 17.00830 Pereca - Grabiszyńska
## 5 17.00680 Żelazna - Pereca
## 6 17.05151 Wyszyńskiego - Prusa
## timestamp company_id
## 1 2016-12-31T09:23:28.508000Z wroclawski-rower-miejski
## 2 2016-12-31T09:23:28.442000Z wroclawski-rower-miejski
## 3 2016-12-31T09:23:28.471000Z wroclawski-rower-miejski
## 4 2016-12-31T09:23:28.500000Z wroclawski-rower-miejski
## 5 2016-12-31T09:23:28.507000Z wroclawski-rower-miejski
## 6 2016-12-31T09:23:28.534000Z wroclawski-rower-miejski
The next rows are meant to apply the foo (get data from one system) function to the 452 systems available in Citybikes.
# Apply function to all the bike systems
# u = array_url[1:452,]
# i = array_id[1:452,]
#
# foo_all <- lapply(u,foo,i)
So the above chunk shows the techniques to play with the API. Now I will start to get real time data of eight european cities each hour and store that data in a SQLite Database.
Using the function created before, I test first one city.
# Test one city.
u = "http://api.citybik.es/v2/networks/dublinbikes"
i = "dublin"
city_ind = foo(u,i)
head(city_ind)
## empty_slots free_bikes id latitude
## 1 19 16 b1c774f178240d5d757854364dafc840 53.34143
## 2 28 2 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3 38 2 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4 2 28 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5 29 11 6bff19daf1747a1e956a899ef8021232 53.34071
## 6 12 23 11b359924d83d802a41856c405d13e61 53.33878
## longitude name timestamp company_id
## 1 -6.246720 FENIAN STREET 2016-12-31T09:22:10.080000Z dublin
## 2 -6.246154 CITY QUAY 2016-12-31T09:22:10.111000Z dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-12-31T09:22:10.216000Z dublin
## 4 -6.300217 BROOKFIELD ROAD 2016-12-31T09:22:10.052000Z dublin
## 5 -6.308191 EMMET ROAD 2016-12-31T09:22:10.219000Z dublin
## 6 -6.303950 ROTHE ABBEY 2016-12-31T09:22:10.214000Z dublin
Then I create (just the first time) the SQLite database nd store the results for future analysis.
# Create a DB (sqlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create DB & Table
# my_db <- src_sqlite( "bikes.sqlite3", create = TRUE) # crear DB
# copy_to( my_db, city_ind, "bikes_records_2", temporary = FALSE) # crear tabla
Set up a cron job to get and store data for the selected list of systems with the following code (see store_citybik.R)
In this following code chunk you will see the case of one city (Dublin) but if you go to store_citybik.R you will see how I got data from eight different cities in the cron job.
# libraries
library(RSQLite)
#setwd("/home/albertogonzalez/Desktop/bestiario/Quadrigram/2016/monografias/bikes")
# Open DB connection
con <- dbConnect(SQLite(),dbname="bikes.sqlite3")
my_db <- src_sqlite( "bikes.sqlite3", create = FALSE)
# Function to get data
foo = function(url,id){
input = as.character(url)
read_input = jsonlite::fromJSON(input, simplifyDataFrame = TRUE)
input_df_2 = read_input$network
input_df_3 = input_df_2$stations
input_df = input_df_3[c(1,3:8)]
input_df$company_id = as.character(id)
return (input_df)
}
# DUBLIN
u = "http://api.citybik.es/v2/networks/dublinbikes"
i = "dublin"
city_ind = foo(u,i)
dim(city_ind)
## [1] 101 8
head(city_ind)
## empty_slots free_bikes id latitude
## 1 19 16 b1c774f178240d5d757854364dafc840 53.34143
## 2 28 2 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3 38 2 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4 2 28 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5 29 11 6bff19daf1747a1e956a899ef8021232 53.34071
## 6 12 23 11b359924d83d802a41856c405d13e61 53.33878
## longitude name timestamp company_id
## 1 -6.246720 FENIAN STREET 2016-12-31T09:22:10.080000Z dublin
## 2 -6.246154 CITY QUAY 2016-12-31T09:22:10.111000Z dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-12-31T09:22:10.216000Z dublin
## 4 -6.300217 BROOKFIELD ROAD 2016-12-31T09:22:10.052000Z dublin
## 5 -6.308191 EMMET ROAD 2016-12-31T09:22:10.219000Z dublin
## 6 -6.303950 ROTHE ABBEY 2016-12-31T09:22:10.214000Z dublin
Once you got data, you need to insert it into the database.
# Append rows to DB:
db_insert_into( con = my_db$con, table = "bikes_records_2", values = city_ind) # append rows
## [1] TRUE
The final part of this process is coming from model_bikes.R I have been gathering data from those eight cities for a couple of months, therefore I have a rich dataset which in reality is a bunch of datasets that can be used to apply data analysis to any of those european cities.
library(jsonlite)
library(RSQLite)
library(lubridate)
library(tidyr)
library(ggmap)
library(ggthemes)
library(dplyr)
The first thing to do is to connect to the database where I stored data and get it all.
# Open connection
con <- dbConnect(SQLite(),dbname="bikes.sqlite3")
my_db <- src_sqlite( "bikes.sqlite3", create = FALSE)
# Get all data
all_data = dbGetQuery(con,"SELECT * FROM bikes_records_2")
str(all_data)
## 'data.frame': 1075613 obs. of 8 variables:
## $ empty_slots: int 15 3 8 27 36 34 30 17 40 39 ...
## $ free_bikes : int 20 26 32 0 4 1 0 3 0 1 ...
## $ id : chr "b1c774f178240d5d757854364dafc840" "135d1f3410c5dbcc49c55a1d522dbbe6" "153ff4dfb7bd8912ef91c10849129c2e" "516ee41c56889c29b8fa5f4ac0e824f4" ...
## $ latitude : num 53.3 53.3 53.3 53.3 53.3 ...
## $ longitude : num -6.25 -6.25 -6.25 -6.3 -6.31 ...
## $ name : chr "FENIAN STREET" "CITY QUAY" "FITZWILLIAM SQUARE EAST" "BROOKFIELD ROAD" ...
## $ timestamp : chr "2016-10-17T12:47:15.653000Z" "2016-10-17T12:47:15.673000Z" "2016-10-17T12:47:15.755000Z" "2016-10-17T12:47:15.625000Z" ...
## $ company_id : chr "dublin" "dublin" "dublin" "dublin" ...
head(all_data)
## empty_slots free_bikes id latitude
## 1 15 20 b1c774f178240d5d757854364dafc840 53.34143
## 2 3 26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3 8 32 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4 27 0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5 36 4 6bff19daf1747a1e956a899ef8021232 53.34071
## 6 34 1 11b359924d83d802a41856c405d13e61 53.33878
## longitude name timestamp company_id
## 1 -6.246720 FENIAN STREET 2016-10-17T12:47:15.653000Z dublin
## 2 -6.246154 CITY QUAY 2016-10-17T12:47:15.673000Z dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z dublin
## 4 -6.300217 BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z dublin
## 5 -6.308191 EMMET ROAD 2016-10-17T12:47:15.766000Z dublin
## 6 -6.303950 ROTHE ABBEY 2016-10-17T12:47:15.744000Z dublin
Let’s apply some time transformations.
# Insert dates columns
#all_data$timestamp = as.Date(all_data$timestamp)
all_data$time_2 = ymd_hms(all_data$timestamp)
all_data$week_day = wday(all_data$time_2,label = TRUE)
all_data$hour = hour(all_data$time_2)
all_data$shift = cut(all_data$hour,c(0,7,15,23),c("night","morning","afternoon"))
We can define a capacity status in terms of how many bikes are being used in a given station. We define a station as overflow when there is few empty slots to park. As far as I have seen is other similar analysis, thresholds to score a station as overflow, balanced or shortage are subjective. In this analysis, stations with empty slots from 0% to 25% are scored as overflow. Stations with empty slots from 75% to 100% are scored as shortage and everything in between is balanced.
# Total slots and shifts calculations
all_data$total_slots = all_data$empty_slots + all_data$free_bikes
all_data$empty_status = (all_data$empty_slots / all_data$total_slots) * 100
all_data$empty_status_label = cut(all_data$empty_status,c(0,25,75,100),c("overflow","balanced","shortage"))
all_data$empty_status_label[all_data$empty_status==0] <- "overflow"
head(all_data)
## empty_slots free_bikes id latitude
## 1 15 20 b1c774f178240d5d757854364dafc840 53.34143
## 2 3 26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664
## 3 8 32 153ff4dfb7bd8912ef91c10849129c2e 53.33521
## 4 27 0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901
## 5 36 4 6bff19daf1747a1e956a899ef8021232 53.34071
## 6 34 1 11b359924d83d802a41856c405d13e61 53.33878
## longitude name timestamp company_id
## 1 -6.246720 FENIAN STREET 2016-10-17T12:47:15.653000Z dublin
## 2 -6.246154 CITY QUAY 2016-10-17T12:47:15.673000Z dublin
## 3 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z dublin
## 4 -6.300217 BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z dublin
## 5 -6.308191 EMMET ROAD 2016-10-17T12:47:15.766000Z dublin
## 6 -6.303950 ROTHE ABBEY 2016-10-17T12:47:15.744000Z dublin
## time_2 week_day hour shift total_slots empty_status
## 1 2016-10-17 12:47:15 Mon 12 morning 35 42.85714
## 2 2016-10-17 12:47:15 Mon 12 morning 29 10.34483
## 3 2016-10-17 12:47:15 Mon 12 morning 40 20.00000
## 4 2016-10-17 12:47:15 Mon 12 morning 27 100.00000
## 5 2016-10-17 12:47:15 Mon 12 morning 40 90.00000
## 6 2016-10-17 12:47:15 Mon 12 morning 35 97.14286
## empty_status_label
## 1 balanced
## 2 overflow
## 3 overflow
## 4 shortage
## 5 shortage
## 6 shortage
Now it’s time to focus on a single city, in this case, London.
# One City
ci = "london" # line 38 in model bikes.R
ci_df = all_data %>%
filter(company_id == ci)
dim(ci_df)
## [1] 274291 15
ci_df = na.omit(ci_df)
dim(ci_df)
## [1] 271950 15
head(ci_df)
## empty_slots free_bikes id latitude
## 1 5 11 d79e95513c3b5e1d3c78a65874054481 51.49887
## 2 20 0 090f297e79c030688a8a51755ebeafe3 51.50172
## 3 9 9 7f3020118e56165ed8b2f61899edb971 51.52916
## 4 13 22 67e6c16bce05410ba4b1f0f5000726ea 51.49961
## 5 15 8 d0ca154129878ce71261d563d8da9238 51.53006
## 6 7 19 8e74e96bcc8d74d7621f8201cee921dd 51.49313
## longitude name
## 1 -0.1374246 200178 - Buckingham Gate, Westminster
## 2 -0.1798542 001231 - Queen's Gate, Kensington Gardens
## 3 -0.1099705 001023 - River Street , Clerkenwell
## 4 -0.1975742 001018 - Phillimore Gardens, Kensington
## 5 -0.1209737 001013 - St. Chad's Street, King's Cross
## 6 -0.1568760 003420 - Sedding Street, Sloane Square
## timestamp company_id time_2 week_day hour
## 1 2016-10-17T14:08:07.773000Z london 2016-10-17 14:08:07 Mon 14
## 2 2016-10-17T14:08:06.788000Z london 2016-10-17 14:08:06 Mon 14
## 3 2016-10-17T14:08:04.972000Z london 2016-10-17 14:08:04 Mon 14
## 4 2016-10-17T14:08:04.980000Z london 2016-10-17 14:08:04 Mon 14
## 5 2016-10-17T14:08:04.989000Z london 2016-10-17 14:08:04 Mon 14
## 6 2016-10-17T14:08:04.993000Z london 2016-10-17 14:08:04 Mon 14
## shift total_slots empty_status empty_status_label
## 1 morning 16 31.25000 balanced
## 2 morning 20 100.00000 shortage
## 3 morning 18 50.00000 balanced
## 4 morning 35 37.14286 balanced
## 5 morning 23 65.21739 balanced
## 6 morning 26 26.92308 balanced
Mean capacity by weekday, hour and station.
## Group by weekday, hour and station - mean of used_capacity
ci_df_2 = ci_df %>%
group_by(week_day,hour,name) %>%
summarise(avg_used_capacity = mean(empty_status, na.rm=TRUE)) %>%
arrange(desc(avg_used_capacity))
ci_df_ref = as.data.frame(ci_df_2)
head(ci_df_ref)
## week_day hour name
## 1 Sun 8 000970 - Scala Street, Fitzrovia
## 2 Sun 8 000999 - Queen Street 1, Bank
## 3 Sun 8 001006 - Bayley Street , Bloomsbury
## 4 Sun 8 001016 - Hereford Road, Bayswater
## 5 Sun 8 001043 - Museum of London, Barbican
## 6 Sun 8 001048 - Queen Victoria Street, St. Paul's
## avg_used_capacity
## 1 100
## 2 100
## 3 100
## 4 100
## 5 100
## 6 100
We can apply a rank technique to get X number of stations:
# Top and bottom stations
top_stations_weekday = ci_df_ref[c(1:250),]
bottom_stations_weekday = tail(ci_df_ref,250)
Although it’s simple to get that top and bottom used stations grouping just by station:
## Group by station - mean de used_capacity
ci_df_3 = ci_df %>%
group_by(name) %>%
summarise(avg_used_capacity = mean(empty_status, na.rm=TRUE)) %>%
arrange(desc(avg_used_capacity))
ci_df_3_ref = as.data.frame(ci_df_3)
head(ci_df_3_ref)
## name avg_used_capacity
## 1 300079 - Manfred Road, East Putney 91.54527
## 2 200177 - Santos Road, Wandsworth 91.00370
## 3 300046 - Alma Road, Wandsworth 90.65369
## 4 200174 - Mostyn Grove, Bow 89.82960
## 5 300026 - All Saints' Road, Portobello 89.63444
## 6 200063 - Mexfield Road, East Putney 89.46055
See top and bottom used stations:
# Top and Bottom stations
top_stations = ci_df_3_ref[c(1:10),]
top_stations
## name avg_used_capacity
## 1 300079 - Manfred Road, East Putney 91.54527
## 2 200177 - Santos Road, Wandsworth 91.00370
## 3 300046 - Alma Road, Wandsworth 90.65369
## 4 200174 - Mostyn Grove, Bow 89.82960
## 5 300026 - All Saints' Road, Portobello 89.63444
## 6 200063 - Mexfield Road, East Putney 89.46055
## 7 200220 - Blythe Road West, Shepherd's Bush 88.22927
## 8 200186 - Nantes Close, Wandsworth 88.12582
## 9 200219 - Osiers Road, Wandsworth 88.06595
## 10 200098 - Aintree Street, Fulham 87.16813
bottom_stations = tail(ci_df_3_ref,10)
bottom_stations
## name avg_used_capacity
## 769 022183 - Crinan Street, King's Cross 26.13413
## 770 300234 - Stratford Station, Stratford 25.49022
## 771 001114 - Pont Street, Knightsbridge 25.47857
## 772 002695 - Cadogan Gardens, Chelsea 24.60526
## 773 200238 - Westferry DLR, Limehouse 24.47227
## 774 200050 - Naval Row, Blackwall 24.27386
## 775 200193 - Snowsfields, London Bridge 23.56908
## 776 300022 - Evesham Street, Avondale 21.96792
## 777 200118 - Parkway, Camden Town 21.89531
## 778 003420 - Sedding Street, Sloane Square 18.56592
If we want to have a better view, we can geocode those stations.
## Geolocate main stations
ci_df_geo = ci_df[,c("name","latitude","longitude")]
ci_df_geo_uni = unique(ci_df_geo)
top_stations_geo = merge(top_stations,ci_df_geo_uni,by = "name")
bottom_stations_geo = merge(bottom_stations,ci_df_geo_uni,by = "name")
head(top_stations_geo)
## name avg_used_capacity latitude
## 1 200063 - Mexfield Road, East Putney 89.46055 51.45682
## 2 200098 - Aintree Street, Fulham 87.16813 51.48102
## 3 200174 - Mostyn Grove, Bow 89.82960 51.53054
## 4 200177 - Santos Road, Wandsworth 91.00370 51.45706
## 5 200186 - Nantes Close, Wandsworth 88.12582 51.46193
## 6 200219 - Osiers Road, Wandsworth 88.06595 51.46067
## longitude
## 1 -0.2028021
## 2 -0.2099735
## 3 -0.0254920
## 4 -0.2008063
## 5 -0.1807918
## 6 -0.1987354
head(bottom_stations_geo)
## name avg_used_capacity latitude
## 1 001114 - Pont Street, Knightsbridge 25.47857 51.49689
## 2 002695 - Cadogan Gardens, Chelsea 24.60526 51.49246
## 3 003420 - Sedding Street, Sloane Square 18.56592 51.49313
## 4 022183 - Crinan Street, King's Cross 26.13413 51.53447
## 5 200050 - Naval Row, Blackwall 24.27386 51.50898
## 6 200118 - Parkway, Camden Town 21.89531 51.53807
## longitude
## 1 -0.1612038
## 2 -0.1599190
## 3 -0.1568760
## 4 -0.1222030
## 5 -0.0069900
## 6 -0.1446640
Now that we have geocoded top and bottom stations, we can plot them in a map.
m <- get_map("London",zoom=12,maptype="terrain-lines",source="stamen")
ggmap(m, base_layer = ggplot(aes(x = longitude, y = latitude), data = top_stations_geo)) + geom_point(color="green") + theme_minimal() + ggtitle("Top Stations")
ggmap(m, base_layer = ggplot(aes(x = longitude, y = latitude), data = bottom_stations_geo)) + geom_point(color="red") + theme_minimal() + ggtitle("Bottom Stations")
## Warning: Removed 2 rows containing missing values (geom_point).
We can plot the top and bottom stations by week day.
top_stations_weekday_geo = merge(top_stations_weekday,ci_df_geo_uni,by = "name")
bottom_stations_weekday_geo = merge(bottom_stations_weekday,ci_df_geo_uni,by = "name")
head(top_stations_weekday_geo)
## name week_day hour avg_used_capacity
## 1 000967 - Golden Lane, Barbican Sun 13 100
## 2 000967 - Golden Lane, Barbican Sun 9 100
## 3 000969 - Gower Place , Euston Mon 18 100
## 4 000969 - Gower Place , Euston Tues 17 100
## 5 000970 - Scala Street, Fitzrovia Sun 8 100
## 6 000974 - Guilford Street , Bloomsbury Mon 18 100
## latitude longitude
## 1 51.52175 -0.09447507
## 2 51.52175 -0.09447507
## 3 51.52523 -0.13518856
## 4 51.52523 -0.13518856
## 5 51.51991 -0.13603967
## 6 51.52335 -0.12020261
head(bottom_stations_weekday_geo)
## name week_day hour avg_used_capacity
## 1 000960 - Hop Exchange, The Borough Sat 13 0
## 2 000960 - Hop Exchange, The Borough Sat 14 0
## 3 000962 - Stamford Street, South Bank Sat 15 0
## 4 000964 - Bath Street, St. Luke's Fri 22 0
## 5 000965 - Tachbrook Street, Victoria Fri 5 0
## 6 000969 - Gower Place , Euston Sat 13 0
## latitude longitude
## 1 51.50463 -0.09177378
## 2 51.50463 -0.09177378
## 3 51.50557 -0.11160670
## 4 51.52589 -0.09084776
## 5 51.49211 -0.13836485
## 6 51.52523 -0.13518856
ggmap(m, base_layer = ggplot(aes(x = longitude, y = latitude), data = top_stations_weekday_geo)) + facet_grid(week_day ~ hour) + geom_point(color="green") + theme_minimal() + ggtitle("Top Stations by weekday and hour")
## Warning: Removed 7 rows containing missing values (geom_point).
ggmap(m, base_layer = ggplot(aes(x = longitude, y = latitude), data = bottom_stations_weekday_geo)) + facet_grid( week_day ~ hour) + geom_point(color="red") + theme_minimal() + ggtitle("Bottom Stations by weekday and hour")
## Warning: Removed 12 rows containing missing values (geom_point).
ci_df_3_geo = merge(ci_df_3,ci_df_geo_uni,by = "name")
colnames(ci_df_3_geo) = c("name","avgcapacity","latitude","longitude")
Adding up the empty slots of stations can give us a good proxy of total bikes being used.
## Number of total bikes being used
circulando = ci_df %>%
group_by(hour) %>%
summarise(bicis_circulando = sum(empty_slots, na.rm=TRUE)) %>%
arrange(desc(bicis_circulando))
circulando_df = as.data.frame(circulando)
circulando_df$city = ci
head(circulando_df)
## hour bicis_circulando city
## 1 10 430339 london
## 2 8 420704 london
## 3 11 403440 london
## 4 12 381153 london
## 5 9 379578 london
## 6 13 379430 london
And then we can plot it.
library(scales)
cir_plot = ggplot(circulando_df,aes(as.factor(hour),bicis_circulando)) + geom_point() + theme_minimal() + ggtitle("Total Bikes in movement by hour")
cir_plot + scale_y_continuous(labels = comma)
We can also incorporate population data.
# Incorporación de las poblaciones en las ciudades
# lista de ciudades
cities = as.data.frame(unique(all_data$company_id))
str(cities)
## 'data.frame': 8 obs. of 1 variable:
## $ unique(all_data$company_id): Factor w/ 8 levels "barcelona","budapest",..: 3 4 1 5 7 8 6 2
names(cities) = c("city")
cities_pop = as.data.frame(c(553165,630752,1604555,14025646,1347707,2249975,8673713,1759407))
names(cities_pop) = "population"
cities_df = cbind(cities,cities_pop)
head(cities_df)
## city population
## 1 dublin 553165
## 2 helsinki 630752
## 3 barcelona 1604555
## 4 istanbul 14025646
## 5 milano 1347707
## 6 paris 2249975
ggplot(cities_df,aes(city,population)) + geom_point() + theme_minimal() + scale_y_continuous(labels = comma)
And add a per_mil_capita metric.
cities_df_pop = merge(circulando_df,cities_df,by = "city",all.y = FALSE)
cities_df_pop$popu_mil = cities_df_pop$population/1000
cities_df_pop$per_mil_capita = cities_df_pop$bicis_circulando /
cities_df_pop$popu_mil
head(cities_df_pop)
## city hour bicis_circulando population popu_mil per_mil_capita
## 1 london 10 430339 8673713 8673.713 49.61416
## 2 london 8 420704 8673713 8673.713 48.50333
## 3 london 11 403440 8673713 8673.713 46.51295
## 4 london 12 381153 8673713 8673.713 43.94346
## 5 london 9 379578 8673713 8673.713 43.76188
## 6 london 13 379430 8673713 8673.713 43.74482
ggplot(cities_df_pop,aes(hour,per_mil_capita)) + geom_point() + theme_minimal() + ggtitle("Bikes per mil capita and hour")
# Average capacity used per hour
ave_per_hour = ci_df %>%
group_by(hour) %>%
summarise(ave_cap_h = mean(empty_status,na.rm = TRUE)) %>%
arrange(desc(ave_cap_h))
ave_per_hour = as.data.frame(ave_per_hour)
cities_df_ave = merge(cities_df_pop,ave_per_hour,by = "hour")
head(cities_df_ave)
## hour city bicis_circulando population popu_mil per_mil_capita
## 1 1 london 33592 8673713 8673.713 3.872851
## 2 2 london 33583 8673713 8673.713 3.871814
## 3 3 london 33618 8673713 8673.713 3.875849
## 4 4 london 33670 8673713 8673.713 3.881844
## 5 5 london 34156 8673713 8673.713 3.937875
## 6 6 london 47064 8673713 8673.713 5.426050
## ave_cap_h
## 1 55.21850
## 2 55.20689
## 3 55.23499
## 4 55.30567
## 5 56.09910
## 6 57.85625
Let’s plot bikes per mil capita vs average used capacity.
ggplot(cities_df_ave,aes(ave_cap_h,per_mil_capita)) + geom_point() + geom_smooth(method = lm) + theme_minimal() + ggtitle("Bikes per mil capita and Average used capacity")
After a succint analysis, it’s time to create our prediction model. The goal is to predict a station status (overflow, balanced or shortage) in a given week day and hour.
df1 = ci_df[c("empty_slots","free_bikes","latitude","longitude","hour","empty_status","empty_status_label","id","name","shift","week_day")]
df1$empty_status = as.integer(df1$empty_status)
df1 = na.omit(df1)
str(df1)
## 'data.frame': 271950 obs. of 11 variables:
## $ empty_slots : int 5 20 9 13 15 7 6 16 13 3 ...
## $ free_bikes : int 11 0 9 22 8 19 12 0 5 21 ...
## $ latitude : num 51.5 51.5 51.5 51.5 51.5 ...
## $ longitude : num -0.137 -0.18 -0.11 -0.198 -0.121 ...
## $ hour : int 14 14 14 14 14 14 14 14 14 14 ...
## $ empty_status : int 31 100 50 37 65 26 33 100 72 12 ...
## $ empty_status_label: Factor w/ 3 levels "overflow","balanced",..: 2 3 2 2 2 2 2 3 2 1 ...
## $ id : chr "d79e95513c3b5e1d3c78a65874054481" "090f297e79c030688a8a51755ebeafe3" "7f3020118e56165ed8b2f61899edb971" "67e6c16bce05410ba4b1f0f5000726ea" ...
## $ name : chr "200178 - Buckingham Gate, Westminster" "001231 - Queen's Gate, Kensington Gardens" "001023 - River Street , Clerkenwell" "001018 - Phillimore Gardens, Kensington" ...
## $ shift : Factor w/ 3 levels "night","morning",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ week_day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
dim(df1)
## [1] 271950 11
Now I will normalize data so it can fit in a prediction model.
# overflow table
table(df1$empty_status_label)
##
## overflow balanced shortage
## 57173 126060 88717
# normalize function
normalize = function(x){
return ((x - min(x))/ (max(x) - min(x)))
}
df1_norm = as.data.frame(lapply(df1[1:5], normalize))
head(df1_norm)
## empty_slots free_bikes latitude longitude hour
## 1 0.07936508 0.1746032 0.4662308 0.4236568 0.5909091
## 2 0.31746032 0.0000000 0.4963503 0.2427163 0.5909091
## 3 0.14285714 0.1428571 0.7864481 0.5407341 0.5909091
## 4 0.20634921 0.3492063 0.4740658 0.1671494 0.5909091
## 5 0.23809524 0.1269841 0.7959171 0.4938113 0.5909091
## 6 0.11111111 0.3015873 0.4056110 0.3407064 0.5909091
Basically the model will classify a station depending on the number of free slots, number of available bikes, latitude, longitude and hour of the day. Split the dataset in two sets, 70% for training and 30% for testing.
# Split & model
n = nrow(df1_norm)
train = sample(1:n, size = round(0.7*n), replace=FALSE)
df1_norm_train = df1_norm[train,]
dim(df1_norm_train)
## [1] 190365 5
head(df1_norm_train)
## empty_slots free_bikes latitude longitude hour
## 72528 0.17460317 0.0952381 0.8007416 0.5559245 0.3636364
## 219241 0.23809524 0.2857143 0.9678386 0.9673329 0.3181818
## 36797 0.04761905 0.2698413 0.4078660 0.1791595 0.1363636
## 85251 0.26984127 0.0952381 0.4596363 0.5086424 0.5909091
## 210592 0.22222222 0.4920635 0.5693457 0.7324846 0.7272727
## 143920 0.09523810 0.2380952 0.7487335 0.3952582 0.2727273
df1_norm_test = df1_norm[-train,]
dim(df1_norm_test)
## [1] 81585 5
head(df1_norm_test)
## empty_slots free_bikes latitude longitude hour
## 2 0.31746032 0.00000000 0.4963503 0.2427163 0.5909091
## 3 0.14285714 0.14285714 0.7864481 0.5407341 0.5909091
## 4 0.20634921 0.34920635 0.4740658 0.1671494 0.5909091
## 10 0.04761905 0.33333333 0.7313622 0.4872924 0.5909091
## 15 0.03174603 0.36507937 0.7071568 0.5448459 0.5909091
## 18 0.15873016 0.06349206 0.7961175 0.4175939 0.5909091
df1_train_labels = df1$empty_status_label[train]
df1_test_labels = df1$empty_status_label[-train]
Let’s train and test the model. I will use knn with k = 21.
# Training the model
library(class)
library(gmodels)
df1_test_pred_2 = knn(df1_norm_train,df1_norm_test,cl = df1_train_labels, k = 21)
CrossTable(df1_test_labels,df1_test_pred_2)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 81585
##
##
## | df1_test_pred_2
## df1_test_labels | overflow | balanced | shortage | Row Total |
## ----------------|-----------|-----------|-----------|-----------|
## overflow | 16548 | 629 | 0 | 17177 |
## | 47063.897 | 6741.591 | 5651.556 | |
## | 0.963 | 0.037 | 0.000 | 0.211 |
## | 0.974 | 0.017 | 0.000 | |
## | 0.203 | 0.008 | 0.000 | |
## ----------------|-----------|-----------|-----------|-----------|
## balanced | 435 | 36696 | 675 | 37806 |
## | 7023.865 | 21065.560 | 11125.514 | |
## | 0.012 | 0.971 | 0.018 | 0.463 |
## | 0.026 | 0.972 | 0.025 | |
## | 0.005 | 0.450 | 0.008 | |
## ----------------|-----------|-----------|-----------|-----------|
## shortage | 0 | 434 | 26168 | 26602 |
## | 5537.559 | 11459.181 | 34652.451 | |
## | 0.000 | 0.016 | 0.984 | 0.326 |
## | 0.000 | 0.011 | 0.975 | |
## | 0.000 | 0.005 | 0.321 | |
## ----------------|-----------|-----------|-----------|-----------|
## Column Total | 16983 | 37759 | 26843 | 81585 |
## | 0.208 | 0.463 | 0.329 | |
## ----------------|-----------|-----------|-----------|-----------|
##
##
df1_test = df1[-train,]
df1_test$label_pred = df1_test_pred_2
As you can see in the previous table, the model predicts quite well.
table(df1_test$empty_status_label,df1_test$label_pred)
##
## overflow balanced shortage
## overflow 16548 629 0
## balanced 435 36696 675
## shortage 0 434 26168
df1_test$pred = ifelse(df1_test$empty_status_label==df1_test$label_pred,1,0)
df1_test_wrong = df1_test[df1_test$pred == 0, ]
dim(df1_test_wrong)
## [1] 2173 13
Finally we can create a simple tool to search a given station and see its real and predicted status in a given day and hour.
# Generate a vector with the ID/Names list
places = unique(df1_test$id)
length(places)
## [1] 778
places_names = unique(df1_test$name)
head(places_names,20)
## [1] "001231 - Queen's Gate, Kensington Gardens"
## [2] "001023 - River Street , Clerkenwell"
## [3] "001018 - Phillimore Gardens, Kensington"
## [4] "001022 - Brunswick Square, Bloomsbury"
## [5] "001008 - Hatton Wall, Holborn"
## [6] "003426 - Hampstead Road (Cartmel), Euston"
## [7] "001004 - Doric Way , Somers Town"
## [8] "003442 - Abingdon Villas, Kensington"
## [9] "000979 - Wenlock Road , Hoxton"
## [10] "003437 - Curzon Street, Mayfair"
## [11] "001030 - Lisson Grove, St. John's Wood"
## [12] "001017 - Great Dover Street, The Borough"
## [13] "000989 - Murray Grove , Hoxton"
## [14] "001064 - Old Street Station, St. Luke's"
## [15] "000978 - Russell Square Station, Bloomsbury"
## [16] "001082 - Borough Road, Elephant & Castle"
## [17] "001085 - Old Quebec Street, Marylebone"
## [18] "001055 - Wellington Road, St. John's Wood"
## [19] "000995 - Little Argyll Street, West End"
## [20] "000998 - Lollard Street, Vauxhall"
input_place = "001018 - Phillimore Gardens, Kensington"
input_hour = 8
input_weekday = "Mon"
selected_place = filter(df1_test,name == input_place & hour == input_hour & week_day == input_weekday)
table(selected_place$empty_status_label,selected_place$label_pred)
##
## overflow balanced shortage
## overflow 1 0 0
## balanced 0 0 0
## shortage 0 0 0